Initial version of aarch64 container with Vulkan #270

sroecker · 2024-10-09T15:51:09Z

Initial version for aarch64 container with Vulkan support that runs on libkrun containers on MacOs

ericcurtin · 2024-10-09T15:57:30Z

I think we should merge this... But we do have a vulkan image based on kompute also... It's all about naming... Please working with @slp and @rhatdan to agree on names...

ericcurtin · 2024-10-09T15:58:04Z

@sroecker please sign your commit, this is failing the DCO build:

git commit --amend -s

to sign an old commit.

Signed-off-by: Steffen Roecker <[email protected]>

rhatdan · 2024-10-09T16:01:40Z

I would prefer these all to be based off a base image with all of the python tools required to run ramalama and then rocm, vulcan, ... can all share the lower layer.

slp · 2024-10-09T16:15:08Z

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

sroecker · 2024-10-09T16:20:17Z

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

I had to test a smaller model due to machine constraints:
https://huggingface.co/MaziyarPanahi/SmolLM-1.7B-Instruct-GGUF/blob/main/SmolLM-1.7B-Instruct.Q4_K_M.gguf

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M1 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: Virtio-GPU Venus (Apple M1 Pro) buffer size =  1005.01 MiB
llm_load_tensors:        CPU buffer size =    78.75 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Virtio-GPU Venus (Apple M1 Pro) KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.19 MiB
llama_new_context_with_model: Virtio-GPU Venus (Apple M1 Pro) compute buffer size =   148.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 5

system_info: n_threads = 5 (n_threads_batch = 5) / 5 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 3671259997
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 0

Fibonacci: The Fibonacci sequence is a classic example of a recurrent sequence that can be used to model various phenomena, such as the growth of populations or the stock market.

In summary, recurrences are essential for modeling dynamic systems and capturing the underlying patterns and behaviors of these systems over time. [end of text]


llama_perf_sampler_print:    sampling time =       1.97 ms /    63 runs   (    0.03 ms per token, 32061.07 tokens per second)
llama_perf_context_print:        load time =    1363.18 ms
llama_perf_context_print: prompt eval time =     614.52 ms /     4 tokens (  153.63 ms per token,     6.51 tokens per second)
llama_perf_context_print:        eval time =    1134.46 ms /    58 runs   (   19.56 ms per token,    51.13 tokens per second)
llama_perf_context_print:       total time =    1755.12 ms /    62 tokens

I can check with the kompute backend tomorrow.

slp · 2024-10-09T21:22:13Z

Tested with Mistral-7B and Wizard-Vicuna-13B and got random answers with both of them. Sadly, the Vulkan backend is still broken for Apple Silicon GPUs upstream.

I think we're going to need to stay for a while with the Kompute backend, as implemented in #235.

Initial version of aarch64 container with Vulkan

3e41c4e

Signed-off-by: Steffen Roecker <[email protected]>

sroecker force-pushed the add_aarch64_vulkan_container branch from 40672af to 3e41c4e Compare October 9, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial version of aarch64 container with Vulkan #270

Initial version of aarch64 container with Vulkan #270

sroecker commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

rhatdan commented Oct 9, 2024

slp commented Oct 9, 2024

sroecker commented Oct 9, 2024 •

edited

Loading

slp commented Oct 9, 2024

Initial version of aarch64 container with Vulkan #270

Are you sure you want to change the base?

Initial version of aarch64 container with Vulkan #270

Conversation

sroecker commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

rhatdan commented Oct 9, 2024

slp commented Oct 9, 2024

sroecker commented Oct 9, 2024 • edited Loading

slp commented Oct 9, 2024

sroecker commented Oct 9, 2024 •

edited

Loading